9 research outputs found

    Korpus beranotasi: ke arah pengembangan korpus bahasa-bahasa di Indonesia

    Get PDF
    Meskipun dikenal sebagai negara dengan keragaman bahasa dan budaya terbesar kedua di dunia setelah Papua Nugini, ironisnya Indonesia juga dikenal sebagai negara yang minim sumber daya bahasa elektronis. Ethnologue (Simons and Fennig 2018) menyebutkan bahwa terdapat 719 bahasa daerah di Indonesia yang tentu saja akan memakan waktu dan biaya untuk membangun sumber daya bahasa (SDB) untuk kesemuanya

    Indonesian lexical bundles in research articles: Frequency, structure, and function

    Get PDF
    Recent studies show that lexical bundles in English are pervasively found in academic discourse. In addition, the characteristics of lexical bundles found vary and differ across registers and genres. Nevertheless, it is still interesting to carry out in languages other than English. This study aims to discover the characteristics of Indonesian lexical bundles that cover frequency, structure, and function in research articles. This study adopted a mixed-method. Identification of the lexical bundle was carried out using WordSmith 7.0 on a corpus comprising 3,125,546 words, taken from 1126 texts, and consisting of six disciplines. With a frequency threshold of 40 per million words and a minimum distribution of 5 texts, 197 lexical bundles have been obtained, consisting of three- to six-word bundles with a total occurrence of 51,813 times. In terms of structure, the incomplete structure is dominating the bundles by 78.7%, with a total frequency of occurrence 38,749 times. This research finds that the pattern of lexical bundles can be classified into five types: noun-based, prepositional-based, verb-based, adjective-based, and clause-based bundles. Lexical bundles in research articles are generally clause-based (49.2%). This indicates that Indonesian lexical bundles vary in structure. The use of clause fragments and passive verbs are the main features in this genre. In terms of the discourse function, research-oriented bundles are the functions that are commonly used, while participant-oriented bundles are the least. Each discourse function has its own structural characteristics. It is also found that one lexical bundle can have two functional categories. These findings contribute to a better understanding of the characteristics of written academic discourse. From the pedagogical point of view, the 铿乶dings can be used as learning material for both native and non-native speakers

    Dictionary 4.0: Alternative Presentations for Indonesian Multilingual Dictionaries

    Get PDF
    Building a multilingual dictionary for 719 languages in Indonesia is a challenging task. We have developed application to create the Leipzig-Jakarta list database for all indigenous languages in Indonesia. The database can be used to generate lexical similarity or lexical distance matrix between languages by comparing the word list. For starter, we covered 11 languages: Indonesian, Javanese, Sundanese, Madurese, Bima, Ternate, Tidore, Palembang Malay, Mandailing Batak, Malay, and Minangkabau. The application has two main features: exploring the existing translations and adding translations to a new language or editing existing translations through crowdsourcing. User acceptance test showed 3.48/4 score

    Dictionary 4.0: Alternative Presentations for Indonesian Multilingual Dictionaries

    Get PDF
    Building a multilingual dictionary for 719 languages in Indonesia is a challenging task. We have developed application to create the Leipzig-Jakarta list database for all indigenous languages in Indonesia. The database can be used to generate lexical similarity or lexical distance matrix between languages by comparing the word list. For starter, we covered 11 languages: Indonesian, Javanese, Sundanese, Madurese, Bima, Ternate, Tidore, Palembang Malay, Mandailing Batak, Malay, and Minangkabau. The application has two main features: exploring the existing translations and adding translations to a new language or editing existing translations through crowdsourcing. User acceptance test showed 3.48/4 score

    SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages

    Get PDF
    This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, V玫ro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Ash谩ninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems' predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems' performance on previously unseen lemmas.Peer reviewe

    UniMorph 4.0:Universal Morphology

    Get PDF

    UniMorph 4.0:Universal Morphology

    Get PDF

    UniMorph 4.0:Universal Morphology

    Get PDF

    UniMorph 4.0:Universal Morphology

    Get PDF
    The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet
    corecore